Chromosome-level genome assembly of the sea cucumber Apostichopus japonicus

Sea cucumber is a morphologically diverse and ecologically important clade of echinoderms. The sea cucumber Apostichopus japonicus is the most economically valuable species of sea cucumber. The initial assembly of the A. japonicus genome was released in 2017. However, this genome assembly is fragmented and lacks relative position information of genes on chromosomes. In this study, we produced a high-quality chromosome-level genome of A. japonicus using Pacbio HiFi long-reads and Hi-C sequencing data. The assembled A. japonicus genome spanned 671.60 Mb with a contig N50 size of 17.20 Mb and scaffold N50 size of 29.65 Mb. A total of 99.9% of the assembly was anchored to 23 chromosomes. In total, 19,828 genes were annotated, and 97.2% of BUSCO genes were fully represented. This high-quality genome of A. japonicus will not only aid in the development of sustainable aquaculture practices, but also lay a foundation for a deeper understanding of their genetic makeup, evolutionary history, and ecological adaptation.


Background & Summary
The sea cucumber belongs to Echinodermata, which occupy an important phylogenetic position together with their sister phylum, Hemichordata. Evolutionary studies of sea cucumber are crucial for understanding the origin of chordates 1 . Sea cucumbers have evolved special behaviors, including super regenerative capacity, aestivation, and anatomy, among others, to adapt to various oceanic environments. This adaptation has taken place over a long evolutionary history, which can be traced back to the early Cambrian era 2 . Sea cucumbers play crucial roles in maintaining the health of the ocean floor by consuming dead plant and animal matter, which helps to keep the sediment in a balanced state 3 . In addition, sea cucumbers are important economic aquaculture species and have been deemed as one of the most valuable functional foods in the sea due to their nutritional and pharmacological properties 4,5 . Among approximately 60 species with exploitation value, Apostichopus japonicus is the most economically important species 6  For the reasons listed above, A. japonicus is one of the most studied echinoderms, and a total of 1,358 research papers on A. japonicus were published between 2000 to 2021 7 . Such research occurs in fisheries 8,9 , immunology 10,11 , food science and biological medicine 4,5 , ecological function 3 , as well as biochemistry and molecular biology [12][13][14] . With the deepening of research requirements and development of biological technologies, studies of the mechanisms on phenotype formation, physiological responses, and behavioral regulation is rapidly developing. Hence, the genetic resources of A. japonicus have been exploited, including the coding and noncoding RNA transcriptome [15][16][17] , proteome 18 , epigenome 19 , genetic linkage map 20,21 and genome 1,22 . Among them, the genome is the most basic data that is essential for most multi-omics analyses.
In www.nature.com/scientificdata www.nature.com/scientificdata/ biology. Genomes with a low contig N50 are generally highly fragmented, resulting in the poor annotation of protein-coding genes and non-coding sequences 24 . For example, the sea cucumber breeding industry is currently primarily focused on traditional breeding methods, with molecular marker-assisted breeding playing a supporting role. A. japonicus breeding is in the transition from Breeding 2.0 (Statistical and experimental design to improve selection effort) to Breeding 3.0 (Integration of genetic and genomic data) 25 . Genome-wide association studies (GWAS) and genomic selection (GS) are required to identify genes related to economic traits and accurately evaluate them in A. japonicus breeding programs. Therefore, genomes with long contig N50s and long continuity are essential. Moreover, the annotation of functional genes and regulatory elements plays important roles in understanding evolutionary mechanisms and genetic regulation, which depend on the high quality of genome.
We applied multiple sequencing technologies, generating 56.54 Gb of Illumina data, 43.22 Gb of PacBio data, and 3.18 Gb of HiC data, to reconstruct the chromosome-level A. japonicus genome (Fig. 1), which is the first known chromosome-level genome of sea cucumber ( Table 1). The final assembly was 671.60 Mb in total length with a contig N50 length of 17.20 Mb and scaffold N50 length of 29.65 Mb ( Table 1). The assembly quality was much better than those of the previous genome versions. The genome developed herein will be an excellent tool to better investigate the mechanisms that drive evolution and biodiversity 23 . By analyzing their genetic basis, researchers can identify the compounds responsible for these benefits and develop new treatments for human diseases 4 . Moreover, scientists can better understand the genetic basis of their responses to environmental stressors by studying their genomes, and develop new tools for monitoring and conserving the ocean's resources 26 . As an important aquaculture species, the accurate genetic analysis of economic traits can help to improve the genetic stability of target traits and the success rate of genetic improvement operations 20 .    Table 4. Statistical analysis of sequencing data from Hi-C.
www.nature.com/scientificdata www.nature.com/scientificdata/ Overall, constructing a high-quality genome for sea cucumbers is a crucial step in advancing the understanding of these unique animals and their importance to the health of the ocean and to human well-being.

Methods
Sample collection and sequencing. The longitudinal muscle of a female A. japonicus was collected in Rushan, Shandong Province, China, in 2021. The sample was washed three times with phosphate buffered saline (PBS), quickly frozen in liquid nitrogen, and stored at −80 °C until DNA extraction. After DNA extraction, a short fragmented library was prepared with an insert size of 350 bp and sequenced using the Illumina Platform to generate 150-bp paired-end reads. For HiFi read generation, high-molecular-weight (HMW) gDNA was sheared to approximately 15 Kb before preparing a PacBio HiFi library. The genomic library was sequenced in CCS mode on the PacBio Sequel II system at Novogene (Beijing, China). After trimming the low-quality reads and adaptor sequences from the raw data, 56.54 Gb of Illumina data and 43.22 Gb of PacBio data with a mean read length of 14.7 Kb were obtained, resulting in 83.58-fold and 63.89-fold coverage of the A. japonicus genome respectively ( Table 2). Such coverage was sufficient for haplotype-resolved assembly.
Hifiasm calculates from the uncollapsed genome, allowing it to preserve haplotype information as much as possible. The HiFi long reads were provided to Hifiasm to generate the monoploid and a pair of haplotype-resolved assembly contig graphs. We assembled 178 contigs with a total length of 671.63 Mb. The maximum contig size and N50 were 38.08 and 17.20 Mb (Table 3), respectively.
Hi-C library preparation, sequencing, and chromosome anchoring. A Hi-C library was prepared following the Hi-C library protocol 28 . After grinding with liquid nitrogen, fresh muscle was cross-linked using 4% formaldehyde solution at room temperature in a vacuum for 30 min. The fixation was terminated using 2.5 M glycine. Following cell lysis, cross-linked DNA was digested using the four-cutter restriction enzyme MboI. The DNA ends were subsequently labeled with biotin-14-dCTP and subjected to blunt-end ligation of the cross-linked fragments. DNA was extracted and purified using the phenol-chloroform extraction method. Sonication was employed to generate fragments ranging from 200 to 600 base pairs, and the ends of these fragments were repaired using a combination of T4 DNA polymerase, T4 polynucleotide kinase, and Klenow DNA polymerase. Streptavidin C1 magnetic beads were utilized for the specific enrichment of biotin-labeled Hi-C samples 29,30 . Following the addition of A-tails to the fragment ends and ligation with Illumina paired-end (PE) sequencing adapters, Hi-C sequencing libraries were subjected to PCR amplification (12-14 cycles) and subsequently www.nature.com/scientificdata www.nature.com/scientificdata/ sequenced on an Illumina PE150 platform. The raw sequence data were filtered to obtain a total of 73.50 Gb of clean data, with Q20 = 96.28% and Q30 = 91.25% (Table 4), which was used to assist chromosome assembly.
HiCUP (v0.8.1) was used to process the Hi-C data 31 . The clean Hi-C data was assembled using the ALLHiC pipeline, which contained a total of five steps: pruning, partitioning, rescuing, optimizing and building 32,33 . Finally, 99.94% of the initial assembled sequences were anchored to 23 pseudo-chromosomes ( Fig. 2) with lengths ranging from 18.21 to 46.02 Mb. The total length of the genome assembly was 671.63 Mb, with 34 scaffolds and a scaffold N50 of 29.65 Mb (Table 5).

Genomic repeat annotation and ncrNA annotation. Repeat sequences of the A. japonicus genome
were identified by homology-based and de novo strategies 34 . First, we integrated the repetitive sequence database predicted by Denovo with the homologous repetitive sequence database, Repbase 35 . Then, we used RepeatScout (v1.0.5) 34 , RepeatModeler (v2.0.1) 36 , Piler (v1.0) 37 and LTR-FINDER (v1.0.6) 38 to identify transposable element (TE) families. Repeatmasker (v4.1.0) 36 , RepeatProteinMask (v4.1.0) and TRF (v4.0.9) 39 were used to identify and classify different repetitive elements by aligning the A. japonicus genome sequences against the integrated database. After removing the redundancy results obtained using the above three methods, the total length of the repeat sequences accounted for 47.33% of the A. japonicus genome. In addition, the Kimura divergence value of TE was calculated using calcDivergenceFromalign.pl 40 . TE landscapes were drawn using createRepeatLandscape. pl 41 (Fig. 3). Among the repeat elements, short interspersed nuclear elements (SINEs) accounted for 0.02% of the genome and long interspersed nuclear elements (LINEs) accounted for 2.94% of the genome. Long terminal repeats (LTRs) and DNA elements accounted for 27.03% and 3.74% of the genome, respectively (Table 6).
For the annotation of noncoding RNA (ncRNA), tRNAScan (v1.4) 42 and blast (v2.2.26) 43 were used for tRNA and rRNA prediction, respectively. Other noncoding RNAs, including miRNA and snRNA were detected by alignment to the Rfam database 44 Table 5. Assembly statistics for Hi-C. www.nature.com/scientificdata www.nature.com/scientificdata/ Protein-coding gene prediction and annotation. Gene structures were predicted using three basic strategies: de novo, homology-based, and transcriptome sequencing-based prediction. Based on the genome sequence, we used Augustus (v3.  (Table 7). We also compared the gene, CDS, and exon and intron lengths to those of the seven other species (Fig. 4) Table 6. Classification of repetitive sequences and ncRNAs in the A. japonicus genome.

Gene set Number
Average transcript length(bp)

Average intron length(bp)
De novo  Table 7. Statistical analyses of the gene structure annotation of the A. japonicus genome.
www.nature.com/scientificdata www.nature.com/scientificdata/ The clean RNA-seq data underwent two types of assembly methods. For transcript assembly, we relied on the reference genome, while de novo assembly was carried out using Trinity (v2.11.0) 59 . Open reading frames (ORFs) were detected using PASA (v2.1.0) 60 . Based on the predictions, we used EvidenceModeler(v1.1.1) 61 to integrate the gene sets predicted using different strategies into a non-redundant and complete gene set of 19,828 protein-coding genes ( Table 7 & Fig. 5a).

Data Records
The genomic Illumina sequencing data were deposited in the SRA at NCBI SRR22523578 70 .
The transcriptomic sequencing data were deposited in the SRA at NCBI SRR17056084 74 .
The Hi-C sequencing data were deposited in the SRA at NCBI SRR23362389-SRR23362392 [75][76][77][78] . The final chromosome assembly and genome annotation files are available in Figshare 79 .

technical Validation
Evaluation of the genome assembly and annotation. We evaluated the genome assembly quality through the following measures: (i) The BUSCO (V4.1.2) 80 evaluation was performed using a single-copy orthologous gene library, combined with software tools such as tblastn, augustus, and hmmer, to assess the assembled genome. The result showed that 97.2% of gene orthologs were detected in A. japonicus. Among them, 96.7% achieved complete scores, while 0.5% obtained fragment scores. This indicates a relatively comprehensive assembly outcome (Supplementary fig. 1). (ii) Employing the Core Eukaryotic Genes Mapping Approach (CEGMA) (v2.5) 81 , we identified 458 core eukaryotic genes, including 248 highly-conserved core genes used to assess genome and annotation completeness (Supplementary table 1). By aligning A. japonicus genes to these 248 core genes, we observed homologous genes in the A. japonicus gene sets for 228 core genes, accounting for 91.94% of the total. These findings further support the relatively complete assembly results. (iii) By aligning Illumina sequencing reads to the nuclear genome using BWA (v0.7.8) 82 , we determined a read mapping rate of 96.86% and a coverage rate of 99.83%, indicating high mapping efficiency and comprehensive coverage. (iv) The consensus quality value (QV) of genomes representing per-base consensus accuracy was estimated by Merqury 83 , and the QV of the A. japonicus genome exceeded 45 (48.86), which indicated the high accuracy of the genome assembly. Thus, all of the above results indicated that we obtained the high-quality genome of A. japonicus.

Code availability
No custom scripts or code were used.  Table 8. Statistical analysis of the functional gene annotations of the A. japonicus genome.